140 research outputs found

    Building diachronical reference corpora for the French language

    Get PDF
    International audienceВ докладе представлены проблемы создания референтных диахронических корпусов французского языка пути их решения, предложенные несколькими взаимосвязанными проектами: Репрезентативный корпус первых французских текстов (CoRPTeF), Большая историческая грамматика французского языка (GGHF) и Эволюция системы предлогов во французском языке (PRESTO). Многие из этих проблем релевантны для проекта референтного корпуса любого языка, охватывающего широкую диахроническую перспективу.This paper deals with the problems in creating diachronical reference corpora of the French language and with their solutions proposed by several interconnected projects: the Representative Corpus of the First French Texts (CoRPTeF), the Big Historical Grammar of French (GGHF) and the Evolution of the French Prepositional System (PRESTO). Many of these problems are relevant for a project of a reference corpus of any language including a large diachronical dimension

    Développement de la Base de français médiéval : qualité philologique, ouverture et outillage textométrique

    Get PDF
    International audienceThis paper presents three key aspects in the development of the Base de Français Médiéval Old French text corpus: the quality of data, open source policy for texts and software, and improvement of tools for reading, searching and and analyzing the texts of the corpus.Cette communication présente trois aspects fondamentaux du développement de la Base de français médiéval : la qualité des données pour la recherche, la politique de diffusion ouverte des textes et des outils d’analyse et l’amélioration des outils de lecture, de recherche et d’analyse des textes du corpus.В докладе представлены три ключевых аспекта развития Базы средневекового французского языка (BFM): повышение качества данных, открытость текстов и программного обеспечения и улучшение инструментов для чтения, поиска и анализа текстов корпуса

    A Tag For Punctuation

    Get PDF
    In this paper I will argue that it may be useful to introduce a special tag (e.g. ) for punctuation marks, which could join the TEI Analysis module along with , , and other "segLike" elements. I will first discuss the reasons why punctuation marks may need tagging, and then consider the TEI tags that might be used for that purpose. None of them appears to be perfect for this job. After discussing linguistic properties of punctuation marks, I will propose a tentative formal definition for the element

    From the Holy Grail to the Good Health: a Digital Edition of a 15th Century French Medical Treatise on the BFM Web Portal

    Get PDF
    International audienceThis paper presents a project of a digital edition of the medical treatise entitled L'enseignement ou la manière de garder et conserver la santé (Treatise on the preservation of health), which is translated into Middle French from the Latin work by Guido Parato entitled Libellus de sanitate conservanda (1459). The publication is based on the manuscript St. Petersburg, Russian National Library, Fr.Q.v.VI.1. Methodological principles and technological solutions used in the publication have been developed in the "Quest of the Holy Grail" digital edition project, and are adopted for the future Base de Français Médiéval digital library collection (BFM, http://txm.bfm-corpus.org).В докладе представлен проект электронного издания медицинского трактата "L'enseignement ou la manière de garder et conserver la santé" ("Трактат о сохранении здоровья"), представляющего собой перевод на среднефранцузский язык латинского произведения Гвидо Парато "Libellus de sanitate conservanda" (1459). Издание готовится на основе рукописи СПб., РНБ, Fr.Q.v.VI.1. Методологические принципы и технологические решения, используемые в издании, разработаны в рамках проекта издания "Поиска святого Грааля", а в перспективе издание должно войти в коллекцию электронных текстов Базы средневекового французского (BFM, http://txm.bfm-corpus.org)

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    Approche quantitative des marques graphiques et lexicales de l’oral représenté à travers les corpus BFM et BVH

    Get PDF
    En este artículo, presentamos una experiencia de uso de corpus digitales –la Base de français médiéval y las Bibliothèques virtuelles humanistes – para el estudio de la señalización gráfica y léxica del oral representado y del discurso citado. Según los métodos de edición y el estado del marcado, se pueden aplicar diferentes búsquedas al conjunto de los corpus o a algunas de sus partes. Los indicadores estudiados incluyen la frecuencia relativa del verbo dire (‘decir’), la frecuencia de los diferentes signos de puntuación empleados en las fronteras entre elementos que constituyen episodios de oral representado y, por último, las marcas de cita de fuentes escritas.L’article présente une expérience d’exploitation de corpus numériques, la Base de fran-çais médiéval et les Bibliothèques virtuelles humanistes, pour l’étude de la signalisation graphique et lexicale de l’oral représenté et des citations. Selon les méthodes d’édition et l’état de balisage, différentes requêtes peuvent être appliquées à l’ensemble des corpus ou à leurs parties. Les indicateurs étudiés incluent la fréquence relative du verbe dire, la fréquence des différentes marques de ponctuation utilisées aux frontières des éléments qui constituent les épisodes d’oral représenté et les marques de citation de sources écrites

    Ponctuation française du Moyen Âge au XVIe siècle : théories et pratiques

    Get PDF
    Cette communication passe en revue les théories de la ponctuation qui ont circulé en France dès les premiers écrits jusqu'au XVIe siècle sous forme de traités ou de remarques dans des grammaires et confronterons ces préconisations théoriques à la pratique des scribes et des imprimeurs français. Une attention particulière est accordée à l'évolution des pratiques de la ponctuation liée au développement de l'imprimerie à la fin du XVe et au XVIe siècle. Pour l'étude des pratiques, le corpus " BFM - Manuscrits " (http://bfm.ens-lyon.fr/article.php3?id_article=177) est utilisé pour la période médiévale et la Base Epistemon (http://www.bvh.univ-tours.fr/Epistemon/index.asp) pour le XVIe siècle

    Syntactic Reference Corpus of Medieval French (SRCMF)

    Get PDF
    International audienceThe aim of the SRCMF project is to mark up syntactic structures in the texts of two major old French corpora, the Base de Français Médiéval and the Nouveau Corpus d'Amsterdam. The annotation model is dependency-based. At first, the texts are marked up manually. Later, this annotation is used as "gold standard" to train automatic parsers. The corpus can currently be searched using TigerSearch software. Project documentation will be published on the web as soon as it gets stable stable, and access to the corpus will be provided to researchers upon motivated request

    Метод структурных схем компьютерного морфологического анализа словоформ естественного языка

    Get PDF
    International audiencehttp://mech.math.msu.su/~fpm/ps/k14/k143/k14303.pdfIn this paper, a computerized model for morphological analysis of languages with word-formation based on affixation processes is proposed. The main idea consists in defining structural patterns of words and corresponding lists of suffixes. First, a detaileddescription of a stemming algorithm, its modification, and the technique of determining grammatical characteristics of word-forms are given. The next part of this work focuses on the application of the proposed algorithms for the French language. Finally, some results of execution of these algorithms are provided.http://mech.math.msu.su/~fpm/ps/k14/k143/k14303.pdfВ работе предлагается метод структурных схем в качестве модели морфологического анализа словоформ естественного языка с развитым аффиксальным словообразованием и словоизменением. Дано описание алгоритма выделения псевдоосновы, его модификация, а также алгоритм восстановления грамматических характеристик словоформ. Описано применение предложенного метода для анализа словоформ французского языка. Представлены результаты работы предложенных алгоритмов

    Specifying a TEI-XML Based Format for Aligning Text to Image at Character Level

    Get PDF
    International audienceThis papers presents an experience of specifying and implementing an XML format for text to image alignment at word and character level within the TEI framework. The format in question is a supplementary markup layer applied to heterogeneous transcriptions of medieval Latin and French manuscripts encoded using different " flavors " of the TEI (normalized for critical editions, diplomatic or palaeographic transcriptions). One of the problems that had to be solved was identifying " non-alignable " spans in various kinds of transcriptions. Originally designed in the framework of a research project on the ontology of letter-forms in medieval Latin and vernacular (mostly French) manuscripts and inscriptions, this format can be of use for all kinds of projects that involve fine-grain alignment of transcriptions with zones on digital images